Goto

Collaborating Authors

 step time







Unsupervised Ordering for Maximum Clique

Min, Yimeng, Gomes, Carla P.

arXiv.org Artificial Intelligence

We propose an unsupervised approach for learning vertex orderings for the maximum clique problem by framing it within a permutation-based framework. We transform the combinatorial constraints into geometric relationships such that the ordering of vertices aligns with the clique structures. By integrating this clique-oriented ordering into branch-and-bound search, we improve search efficiency and reduce the number of computational steps. Our results demonstrate how unsupervised learning of vertex ordering can enhance search efficiency across diverse graph instances. We further study the generalization across different sizes.


Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training

Du, Xianzhi, Gunter, Tom, Kong, Xiang, Lee, Mark, Wang, Zirui, Zhang, Aonan, Du, Nan, Pang, Ruoming

arXiv.org Artificial Intelligence

Mixture-of-Experts (MoE) enjoys performance gain by increasing model capacity while keeping computation cost constant. When comparing MoE to dense models, prior work typically adopt the following setting: 1) use FLOPs or activated parameters as a measure of model complexity; 2) train all models to the same number of tokens. We argue that this setting favors MoE as FLOPs and activated parameters do not accurately measure the communication overhead in sparse layers, leading to a larger actual training budget for MoE. In this work, we revisit the settings by adopting step time as a more accurate measure of model complexity, and by determining the total compute budget under the Chinchilla compute-optimal settings. To efficiently run MoE on modern accelerators, we adopt a 3D sharding method that keeps the dense-to-MoE step time increase within a healthy range. We evaluate MoE and dense LLMs on a set of nine 0-shot and two 1-shot English tasks, as well as MMLU 5-shot and GSM8K 8-shot across three model scales at 6.4B, 12.6B, and 29.6B. Experimental results show that even under these settings, MoE consistently outperform dense LLMs on the speed-accuracy trade-off curve with meaningful gaps. Our full model implementation and sharding strategy has been released at https://github.com/apple/axlearn.


ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours

Zhu, Feiwen, Nowaczynski, Arkadiusz, Li, Rundong, Xin, Jie, Song, Yifei, Marcinkiewicz, Michal, Eryilmaz, Sukru Burc, Yang, Jun, Andersch, Michael

arXiv.org Artificial Intelligence

AlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its implementation does not include the necessary training code. OpenFold is the first trainable public reimplementation of AlphaFold. AlphaFold training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more compute resources. In this work, we conducted a comprehensive analysis on the AlphaFold training procedure based on Openfold, identified that inefficient communications and overhead-dominated computations were the key factors that prevented the AlphaFold training from effective scaling. We introduced ScaleFold, a systematic training method that incorporated optimizations specifically for these factors. ScaleFold successfully scaled the AlphaFold training to 2080 NVIDIA H100 GPUs with high resource utilization. In the MLPerf HPC v3.0 benchmark, ScaleFold finished the OpenFold benchmark in 7.51 minutes, shown over $6\times$ speedup than the baseline. For training the AlphaFold model from scratch, ScaleFold completed the pretraining in 10 hours, a significant improvement over the seven days required by the original AlphaFold pretraining baseline.


A Model Predictive Capture Point Control Framework for Robust Humanoid Balancing via Ankle, Hip, and Stepping Strategies

Kim, Myeong-Ju, Lim, Daegyu, Park, Gyeongjae, Park, Jaeheung

arXiv.org Artificial Intelligence

The robust balancing capability of humanoid robots against disturbances has been considered as one of the crucial requirements for their practical mobility in real-world environments. In particular, many studies have been devoted to the efficient implementation of the three balance strategies, inspired by human balance strategies involving ankle, hip, and stepping strategies, to endow humanoid robots with human-level balancing capability. In this paper, a robust balance control framework for humanoid robots is proposed. Firstly, a novel Model Predictive Control (MPC) framework is proposed for Capture Point (CP) tracking control, enabling the integration of ankle, hip, and stepping strategies within a single framework. Additionally, a variable weighting method is introduced that adjusts the weighting parameters of the Centroidal Angular Momentum (CAM) damping control over the time horizon of MPC to improve the balancing performance. Secondly, a hierarchical structure of the MPC and a stepping controller was proposed, allowing for the step time optimization. The robust balancing performance of the proposed method is validated through extensive simulations and real robot experiments. Furthermore, a superior balancing performance is demonstrated, particularly in the presence of disturbances, compared to a state-of-the-art Quadratic Programming (QP)-based CP controller that employs the ankle, hip, and stepping strategies. The supplementary video is available at https://youtu.be/CrD75UbYzdc


Baechi: Fast Device Placement of Machine Learning Graphs

Jeon, Beomyeol, Cai, Linda, Shetty, Chirag, Srivastava, Pallavi, Jiang, Jintao, Ke, Xiaolan, Meng, Yitao, Xie, Cong, Gupta, Indranil

arXiv.org Artificial Intelligence

Machine Learning graphs (or models) can be challenging or impossible to train when either devices have limited memory, or models are large. To split the model across devices, learning-based approaches are still popular. While these result in model placements that train fast on data (i.e., low step times), learning-based model-parallelism is time-consuming, taking many hours or days to create a placement plan of operators on devices. We present the Baechi system, the first to adopt an algorithmic approach to the placement problem for running machine learning training graphs on small clusters of memory-constrained devices. We integrate our implementation of Baechi into two popular open-source learning frameworks: TensorFlow and PyTorch. Our experimental results using GPUs show that: (i) Baechi generates placement plans 654 X - 206K X faster than state-of-the-art learning-based approaches, and (ii) Baechi-placed model's step (training) time is comparable to expert placements in PyTorch, and only up to 6.2% worse than expert placements in TensorFlow. We prove mathematically that our two algorithms are within a constant factor of the optimal. Our work shows that compared to learning-based approaches, algorithmic approaches can face different challenges for adaptation to Machine learning systems, but also they offer proven bounds, and significant performance benefits.